-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce 'LOOKUP' Transform Function #6383
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6383 +/- ##
==========================================
- Coverage 66.44% 65.49% -0.96%
==========================================
Files 1075 1315 +240
Lines 54773 63797 +9024
Branches 8168 9293 +1125
==========================================
+ Hits 36396 41781 +5385
- Misses 15700 19034 +3334
- Partials 2677 2982 +305
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
*/ | ||
public class LookupTransformFunction extends BaseTransformFunction { | ||
public static final String FUNCTION_NAME = "lookUp"; | ||
private static final String TABLE_NAME_SUFFIX = "_OFFLINE"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to assume a dim table is always an offline table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
pkColumns[i] = ArrayUtils.toObject(tf.transformToLongValuesSV(projectionBlock)); | ||
break; | ||
default: | ||
throw new IllegalStateException("Unknown column type for primary key"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to support all types: INT, LONG, FLOAT, DOUBLE, STRING, BYTES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, double is hard to check equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For lookupByPrimaryKey we are relying on the PrimaryKey.hashCode implementation you added here. Here is the dimension table HashMap which is keyed by 'PrimaryKey's. Let me know if you see any potential issues with this usage.
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
* Add DimensionTableData manager * Address review comments. * CLose reader after using. * Revisit javadocs. * Release segment after use. * Touch up instance instantiation. * Cleanup segment in test. * Release segments in "finally" block. * Update logs. * Add TableConfig validations for Dim Tables. * Seperate IngestionConfigTests for dim tables. * Remove defensive null checks. * Fix github action profile name. * Fix ingestionTest dependencies. * Undo the gihub-actions mvn profile name fix.
...core/src/main/java/org/apache/pinot/core/data/manager/offline/DimensionTableDataManager.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java
Outdated
Show resolved
Hide resolved
break; | ||
case BYTES: | ||
byte[][] primitiveValues = tf.transformToBytesValuesSV(projectionBlock); | ||
pkColumns[i] = new Byte[numDocuments][]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkColumns
should be stored as ByteArray[numDocuments]
. Please add a test for all these data types to ensure them working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checked and can't see how this is wrong. What we are doing here is translating the output of the transform function from type byte[numDocuments][]
to Byte[numDocuments][]
so it can be passed back as Object[]
. Second []
is indicating that we simply have 'byte arrays' for each row/entry. I'm updating the loop index variable name to make it a bit more clear.
We already have test coverage for this behavior here (as well as other types).
Let me know if I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The equals()
in PrimaryKey
won't do deep comparison for array, and Byte[]
won't be compared correctly (it will only compare the references). We use ByteArray
as a wrapper to bypass this problem. It is used to store byte[]
internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see what you mean, great catch. Looks like the reason unit tests didn't catch this was, I was mocking the lookupRowByPrimaryKey
to match byte array type PKs by their string representation 🤦. Fixed it and revamped all the test cases to match only by the 'hashCode' of the PK instance. Thanks 👍
Object[] resultSet = new Object[numDocuments]; | ||
for (int i = 0; i < numDocuments; i++) { | ||
// prepare pk | ||
Object[] pkValues = new Object[numPkColumns]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reuse this Object[]
instead of creating a new one per iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do something like:
Object[] resultSet = new Object[numDocuments];
Object[] pkValues = new Object[numPkColumns];
for (int i = 0; i < numDocuments; i++) {
// prepare pk
for (int j = 0; j < numPkColumns; j++) {
pkValues[j] = pkColumns[j][i];
}
// lookup
GenericRow row = _dataManager.lookupRowByPrimaryKey(new PrimaryKey(pkValues));
if (row != null) {
resultSet[i] = row.getValue(_dimColumnName);
}
}
I don't see much point in doing the same for PrimaryKey
though. We will only be reusing the pointer which is not really helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will reuse the object instead of creating one per doc:
Object[] resultSet = new Object[numDocuments];
Object[] pkValues = new Object[numPkColumns];
PrimaryKey primaryKey = new PrimaryKey(pkValues);
for (int i = 0; i < numDocuments; i++) {
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work. We have to create a new 'PrimaryKey' instance per document since the values (pkValues
) are going to be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can directly modify pkValues
without changing primaryKey
. They share the same reference.
This is not critical, so both ways are fine. It just saves minor garbages
pkValues[j] = pkColumns[j][i]; | ||
} | ||
// lookup | ||
GenericRow row = _dataManager.lookupRowByPrimaryKey(new PrimaryKey(pkValues)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrimaryKey
object can also be reused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied above
Thanks for the review @Jackie-Jiang and @yupeng9 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks
Description
Introducing a new transform function;
LookupTransformFunction
as a part of Join project as described in Lookup UDF Join In Pinot. This is a followup to the addition ofDimensionTableManager
in #6346.LOOKUP is a regular transform function which uses the previously added DimensionTableDataManager to execute a lookup from a Dimension table. Call signature is as follows:
TableName: name of the dimension table which will be used
ColumnName: column name from the dimension table to look up
JoinKey: primary key column name for the dimension table. Note: Only primary key is supported for JoinKey
JoinValue: primary key value
*If the dimension table has more then one primary keys (composite PK), you can add more keys and values for the rest of the args: JoinKey2, JoinValue2 ... etc.
Example Query:
Above example joins the dimension table 'baseballTeams' into regular table 'baseballStats' on 'teamID' key. Lookup function returns the value of the column 'teamName'.
To see the function in action you can also fire
JoinQuickstart
and test it as follows:Testing
JoinQuickstart
.Documentation